Model post process for zero stage3 training #17187

pengwa · 2023-08-16T16:22:03Z

Model post process for zero stage3 training

This is the last change to make single GPU/Multiple GPUs run pass.

Design details: https://microsoft.sharepoint.com/:p:/t/ONNX2/EfNfJ43necpIoPI6x5M2zvYBVbfjoPQmG4Boc_F7-tHm1w?e=ekQwA6&nav=eyJzSWQiOjMxNiwiY0lkIjoxMDE1Nzg3NDZ9

PyTorch runs with ZeROOffloadSubscriber:

  model = prepare_model(...)
  from onnxruntime.training.utils.hooks import configure_ort_compatible_zero_stage3
  configure_ort_compatible_zero_stage3()

ORTModule runs with ZeROOffloadSubscriber:

  os.environ['ORTMODULE_ENABLE_ZERO_STAGE3'] = '1'
  from onnxruntime.training.ortmodule import ORTModule
  model = ORTModule(self.model)

It will be fairly easy to debug convergence issue if both ORT and PyTorch can run the same offload path.

Motivation and Context

orttraining/orttraining/python/training/ortmodule/_zero_stage3_compatibility.py

askhade

LGTM

The base branch was changed.

orttraining/orttraining/python/training/ortmodule/_zero_stage3_compatibility.py

+        tensor_input_dtypes: List[torch.onnx.TensorProtoDataType],
+    ) -> Tuple[List[Optional[List[Union[int, str]]]], List[torch.onnx.TensorProtoDataType]]:
+        # output = input.matmul(weight.t())
+        tensor_input_shapes[0]  # input


orttraining/orttraining/python/training/utils/hooks/_subscriber_manager.py

+        # if ctx.current_step >= 0:
+        #     print(f"{'='*6} Completed forward pass for STEP {ctx.current_step} {'='*6}")


…pengwa/zero_post_process

orttraining/orttraining/python/training/ortmodule/_graph_execution_manager.py

askhade

LGTM

pengwa · 2023-09-22T00:54:37Z

Thank you a lot @askhade!!!

### Model post process for zero stage3 training This is the last change to make single GPU/Multiple GPUs run pass. Design details: https://microsoft.sharepoint.com/:p:/t/ONNX2/EfNfJ43necpIoPI6x5M2zvYBVbfjoPQmG4Boc_F7-tHm1w?e=ekQwA6&nav=eyJzSWQiOjMxNiwiY0lkIjoxMDE1Nzg3NDZ9 `PyTorch` runs with ZeROOffloadSubscriber: ``` model = prepare_model(...) from onnxruntime.training.utils.hooks import configure_ort_compatible_zero_stage3 configure_ort_compatible_zero_stage3() ``` `ORTModule` runs with ZeROOffloadSubscriber: ``` os.environ['ORTMODULE_ENABLE_ZERO_STAGE3'] = '1' from onnxruntime.training.ortmodule import ORTModule model = ORTModule(self.model) ``` It will be fairly easy to debug convergence issue if both ORT and PyTorch can run the same offload path. ### Motivation and Context

pengwa added the training issues related to ONNX Runtime training; typically submitted using template label Aug 16, 2023

pengwa requested review from askhade, ajindal1 and baijumeswani August 16, 2023 16:28

askhade reviewed Aug 22, 2023

View reviewed changes

orttraining/orttraining/python/training/ortmodule/_zero_stage3_compatibility.py Outdated Show resolved Hide resolved

askhade reviewed Aug 22, 2023

View reviewed changes

orttraining/orttraining/python/training/ortmodule/_zero_stage3_compatibility.py Outdated Show resolved Hide resolved

askhade previously approved these changes Aug 22, 2023

View reviewed changes

Base automatically changed from pengwa/zero_offload to main August 24, 2023 16:15

rebase main

4e59594

pengwa force-pushed the pengwa/zero_post_process branch from 86704cf to 4e59594 Compare August 24, 2023 16:48

github-advanced-security bot found potential problems Aug 24, 2023

View reviewed changes

pengwa added 6 commits August 25, 2023 12:40

Make PyTorch and ORT both work with ZeRoOffloadSubscriber

5f5beec

fixes

6aa0b4d

refinement

08e8302

minors

a25856d

add convergence debug switches

5e8330a

comment debug log

57300d9

github-advanced-security bot found potential problems Aug 30, 2023

View reviewed changes

pengwa added 2 commits September 15, 2023 07:32

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

a4a8558

…pengwa/zero_post_process

fix bug

f7941b2

askhade reviewed Sep 18, 2023

View reviewed changes

orttraining/orttraining/python/training/ortmodule/_graph_execution_manager.py Show resolved Hide resolved

refine one comment

dd02196

askhade approved these changes Sep 21, 2023

View reviewed changes

pengwa merged commit 6b7bce5 into main Sep 22, 2023

pengwa deleted the pengwa/zero_post_process branch September 22, 2023 00:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model post process for zero stage3 training #17187

Model post process for zero stage3 training #17187

pengwa commented Aug 16, 2023 •

edited

Loading

askhade left a comment

askhade left a comment

pengwa commented Sep 22, 2023

		# if ctx.current_step >= 0:
		# print(f"{'='6} Completed forward pass for STEP {ctx.current_step} {'='6}")

Model post process for zero stage3 training #17187

Model post process for zero stage3 training #17187

Conversation

pengwa commented Aug 16, 2023 • edited Loading